This workshop mostly follows a wonderful tutorial created by Claudia Engel, an
academic technology specialist and lecturer in Anthropology at Stanford
University. However, I’ve edited Engel’s tutorial to place more emphasis
behind the syntax logic of ggplot.
Welcome to the Yale StatLab workshop series, where we will be
covering data visualization using a tool called ggplot,
which is part of the tidyverse package. When I first learnt
ggplot on my own, I was quite confused — I came from coding
in MATLAB, which was very literal (for instance, if you wanted to plot a
sine wave with the equation y = sin(t), it was simply
plot(t,y).
The point of me saying all of this is that if you want to express
more complicated ideas, being literal sometimes does not cut it. If a
good novel only included terse very manner-of-fact dialogues, you would
immediately drop the book. Akin to English, we need a set of rules, a
structure of sorts that helps build up a language. This set of “rules”
is called grammar, and as you will notice with ggplot,
there are a lot of idiosyncrasies that might require some blind faith in
the beginning, but upon mastery you will begin to see the power of
ggplot.
Today, we’ll be working with a dataset from the Stanford Open Policing Project, a project that gathers data across the county on traffic stops made by police officers. In particular, we’ll be looking at police stops in Missippi over a span of three years.
We will use the read_csv function,
which allows for us to define the class types of the data we want read
in.
# You will need to load up tidyverse. If you do not have the package, use the command install.packages("tidyverse")
library(tidyverse)
# Read the CSV
bias <- read_csv("https://raw.githubusercontent.com/cengel/R-data-viz/master/data/MS_stops_by_county.csv")
## Rows: 82 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): county_name, bias
## dbl (8): county_fips, driver_race_Black, driver_race_White, black_pop, white...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
It’s always a good habit to do a quick scan of the data to contextualize what’s happening.
head(bias, n = 2)
What’s we’re looking at is a dataset of policing by counties in Mississippi, with some statistics about the number of Black and White drivers being pulled over. The research question from this dataset is whether there is any racial bias in police stopping – perhaps we want to plot some sort of scatterplot comparing the percent of Black stopped vs percent of White stopped.
Before we jump into the coding, the first thing to know about
ggplot is that it works in layers, sort of like a cake! The
bottom layer is your data, and you can layers things on top to make an
increasingly sophisticated graph.
As mentioned, the “base” of your cake is the data.
ggplot(data = bias)
If you just plotted this and pressed run, you will be greeted by a grey screen on the R-studio plot. That’s good! We’ve set the “base” for the cake.
Okay, let’s start adding an argument called aes.
aes refers to aesthetic mapping, which is the idea of
linking variables in the data to graphical properties in the geometry.
If you’re working in a cartesian coordinate system, for example, this
tells us where the datapoint links exactly on the plane.
Let’s add a few more arguments.
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped))
Two things here. First, notice that our arguments for the x and y
axis (i.e., pct_black_stopped and
pct_white_stopped) do not have quotation marks on it. This
shouldn’t come as a surprise to us, especially for those who joined my
session last week on Intermediate R, where we introduced the pipe
operator. Remember that ggplot is under the
tidyverse universe (no pun intended), similar to how
dplyr is part of the tidyverse universe.
The second question you might ask is “where is my data”? Again,
remember we’ve just mapped the data onto the plane, but we
haven’t told ggplot what the graph we want. This is where
the fancy stuff begins to come into play. For a scatterplot, where we
just want dots or points on the graph, we use the function
geom_point().
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped)) +
geom_point()
Notice the + sign that we had to use, which means that
we’re adding an additional layer to our “cake”. One thing to note is
that the + must be added on the same line of your layer –
if you did something like this, for example, it would not work:
# This does not work
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped))
+ geom_point()
It also doesn’t really matter if the aes argument goes
into the base layer or to the geom_point layer.
ggplot(data = bias) +
geom_point(aes(x = pct_black_stopped, y = pct_white_stopped))
However, for consistency (and readability), we will stick with
putting the aes argument in the data layer.
You can play around with the geom_point argument — for
instance, you don’t want circles on your graph, but want squares
instead. And while we’re at it, you, the picky customer, might think the
squares are too dark, so you would like to change the opacity.
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped)) +
geom_point(shape = 'square', alpha = 0.3) # Notice that we need to put quotations on 'square', as the argument 'square' is not piped from our dataset
Anyway, let’s get back to our original research question, which is to see if there is any bias in police stops. I’m not sure if you’ve seen graphs scatterplots where they plot the female literacy rates against male literacy rates and each “dot” represents a country, or other similar scatterplots where a diagonal line cuts through the axis coordinates to demonstrate some kind of imbalance. But the idea here is that we plot a 45\(^\circ\) degree reference line (or a slope of m = 1 because y = x) on the scatterplot. If the dots fall on the reference line, this means that there is no bias (e.g., in a county where 30% of Whites are stopped and 30% of Blacks are also stopped, the dot would fall on the reference line).
We can achieve this with the geom_abline function, which
draws a line from a to b.
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = 'red')
Hmmm, something seems to be off, as what I wanted was the line to cut
diagonally on the plane. This is because the y and x axis are scaled
differently - if we wanted the slope to cut diagonally, the two axis
have to be the same. We can do this using the
scale_x_continuous and scale_y_continous
argument.
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = 'red') +
scale_x_continuous(limits = c(0, 1)) +
scale_y_continuous(limits = c(0, 1))
## Warning: Removed 1 rows containing missing values (geom_point).
Let’s say we were doing some exploratory data analysis and wanted to see data points where less than 10% of White drivers were stopped. Let’s trying highlighting these data points by changing the color on our graph.
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped, color = pct_white_stopped < 0.1)) +
geom_abline(slope = 1, intercept = 0, color = 'red') +
scale_x_continuous(limits = c(0, 1)) +
scale_y_continuous(limits = c(0, 1)) +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
You might have initially thought that the color argument
would go into the geometries layer. However, this is not the case,
because what we’re conceptually doing is creating a new column
in our dataset, creating a TRUE/FALSE statement (i.e., whether the
percent of Whites stopped is greater than 0.1), and then mapping it onto
the plane using the geom_point function.
Similarly, if we wanted to change something about the properties of
the dots themselves (say we wanted to change the color), we would put
our arguments into the geom_point function.
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped)) +
geom_point(color = 'steelblue')
Interesting! Just by visual inspection we see that there’s something
interesting going on there, as a lot of points seem to be aggregated
below the reference line. We can “zoom in” by tweaking
scale_x_continuous and scale_y_continuous:
ggplot(data = bias, aes(x = pct_black_stopped, y = pct_white_stopped)) +
geom_abline(slope = 1, intercept = 0, color = 'red') +
scale_x_continuous(limits = c(0, 0.1)) +
scale_y_continuous(limits = c(0, 0.1)) +
geom_point()
## Warning: Removed 44 rows containing missing values (geom_point).
Notice that we also get a warning that some values are cut of because we have essentially cropped our graph.
Really, we’ve covered the hardest part of the workshop, which is
getting used to the language and syntax of ggplot. Once we
know how the toolkit belt works, what we’re really doing now is adding
more and more tools into the belt. So give yourselves a pat in the
back.
While the scatterplot gives us a good overview, we might want to look
at each of the counties in Mississippi. We can do this by finding the
difference between pct_white_stopped and
pct_black_stopped (i.e.,
pct_white_stopped - pct_black_stopped), and to see whether
there is bias in police stops at each county. This difference has been
calculated and stored under the variable wb_delta, where a
positive \(\Delta\) indicates a bias
towards stopping Whites.
ggplot(data = bias, aes(x = county_name, y = wb_delta)) +
geom_col()
Clearly, we can’t distinguish one county from another county - it’s not readable at all. We can fix this problem by flipping the coordinates:
ggplot(data = bias, aes(x = county_name, y = wb_delta)) +
geom_col() +
coord_flip()
Since most people read from top to bottom, we might want to flip the
ordering of county_name so that Adams County is on the
bottom (or further top of the screen — but in graphics the
coordinate system is flipped on the y-axis). Unfortunately, there is no
intuitive solution (such as adding a negative sign before the
county_name). You must use the function
scale_x_discrete and reverse the limits. Note that this
will not work if the x variable is continuous.
ggplot(data = bias, aes(x = county_name, y = wb_delta)) +
geom_col() +
coord_flip() +
scale_x_discrete(limits = rev)
Let’s try doing the same thing for the continuous variable
wb_delta, which requires some knowledge of the
idiosyncrasies of ggplot functions. Again, stackoverflow
and Google is your friend here.
ggplot(data = bias, aes(x = county_name, y = wb_delta)) +
geom_col() +
coord_flip() +
scale_y_reverse(limits = c(1,-1)) # You could also just use scale_y_reverse()
We can also reorder the counties by the wb_delta values,
from lowest (indicating Black bias) to highest (indicating White
bias).
ggplot(bias, aes(x = reorder(county_name, -wb_delta), y = wb_delta)) +
geom_col() +
coord_flip()
Before I forget to mention, we could also simplify the code a bit
(which I should have done previously, but for illustrative purposes I
copied everything out) by assigning everything before
geom_col() to p.
p <- ggplot(bias, aes(x = reorder(county_name, -wb_delta), y = wb_delta)) +
geom_col()
p + coord_flip()
Okay, so I think we’ve exhausted this dataset - let’s work with an
extension of the police stops dataset which we will call
reasons.
reasons <- read_csv("https://github.com/cengel/R-data-viz/raw/master/data/MS_stops.csv")
## Rows: 211211 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): id, state, county_name, police_department, driver_gender, driver_...
## dbl (3): county_fips, driver_age, y_day
## date (2): stop_date, driver_birthdate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Again, it’s a good habit to look at both the spec output
from read_csv and some techniques I covered last week to
get a good overview of the data.
head(reasons, n = 2)
Another way to get to know your data is to build a bar graph,
especially when we’re working with categorical variables. This can be
done with the geom_bar function.
ggplot(data = reasons, aes(x = violation)) +
geom_bar()
Notice that we don’t need to specific a y argument for
aes, which makes sense because we are trying to tabulate
the different violation counts. What’s happening here is that the
violation variable gets passed to geom_bar(),
which basically does the job of group_by and
count. This is something we covered last week:
reasons %>% group_by(violation) %>% count()
Generally, ggplot has a shortcut construction that can
do what we just illustrated (in the above example, it was
geom_bar). But this is not always the case, and as
such, it’s good to know functions like after_stat, where
you can manually tweak the calculations. For instance, I might be
interested in getting the percentage of each of the violations:
# Convert to percentage
ggplot(data = reasons, aes(x = violation, y = after_stat(count/ sum(count) * 100))) +
geom_bar()
We could also change the colors of the bar — but instead of using the
color function (which just creates a green border for each
bar), we can use fill.
ggplot(data = reasons, aes(x = violation)) +
geom_bar(fill = "green")
But just changing the colors is not that insightful. Say, for instance, we wanted to know how many of speeding violations were incurred by Whites compared to Blacks.
ggplot(data = reasons, aes(x = violation)) +
geom_bar(aes(fill = driver_race))
Here the aes function needs to be called again (i.e.,
geom_bar(fill = driver_race) won’t work). This is similar
to the earlier example, where we wanted to change the
pct_white_stopped colors after we defined a certain
threshold. We could just have included the fill argument in
our data layer.
ggplot(data = reasons, aes(x = violation, fill = driver_race)) +
geom_bar()
One more point on this. We could also tell ggplot to stretch the bars between [0,1], and find the proportions of violations that are grouped by race.
# Convert to percentage
ggplot(data = reasons, aes(x = violation)) +
geom_bar(aes(fill = driver_race), position = "fill")
We’ll now do a gear shift into plotting time series data, which is dear and near to me as I’ve worked on analyzing COVID datasets. In this example, we’re interested in knowing what the type of violations occur on a given day of the week.
To do so, we want to split our datasets into days of the week, and
then split that dataset based on violations. We should now be familiar
with the handy group_by function.
dow_violations <- reasons %>%
group_by(wk_day, violation) %>%
count()
head(dow_violations, n=6)
## # A tibble: 6 × 3
## # Groups: wk_day, violation [6]
## wk_day violation n
## <chr> <chr> <int>
## 1 Fri Breaks-Lights-etc 564
## 2 Fri Careless driving 1150
## 3 Fri License-Permit-Insurance 6603
## 4 Fri Other or unknown 2603
## 5 Fri Seat belt 3948
## 6 Fri Speeding 20767
Now, let’s try plotting it out.
ggplot(dow_violations, aes(x = wk_day, y = n)) +
geom_line()
Woah! This wasn’t what we wanted! What happened here is that
ggplot displayed the range of values for each day of the
week (on Friday, for example, we see the range to be [564
(Breaks-Lights-etc), 20767 (Speeding)]. Let’s fix this by adding the
group argument.
ggplot(dow_violations, aes(x = wk_day, group = violation, y = n)) +
geom_line()
That’s better. But we don’t know which line corresponds to which
violation. To do so, we need to use the color function.
ggplot(dow_violations, aes(x = wk_day, group = violation, y = n, color = violation)) +
geom_line()
If you notice the days of the week, they are a bit messed up. As
we’ve seen a few times before, R usually arranges categorical variables
(even date class) by alphabetical order. To fix this problem, we need to
manually specify what the order should be by forcing it as a
factor. Remember that whenever you need variables to be
arranged in a particular order, use factors as factors not only store
the variables of interest, but assigns an index to them.
dow_violations$wk_day <- factor(dow_violations$wk_day,
levels=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
Let’s give this another try.
ggplot(dow_violations, aes(x = wk_day, group = violation, y = n, color = violation)) +
geom_line()
This is a great segue to facet_wrap, our last subject of
the day. Imagine you wanted separate panels for each violation
classification. What faceting essentially does is split one big plot
into multiple smaller plots, making the plot less confusing and easier
to interpret for the viewer.
Let’s use the facet_wrap function.
ggplot(dow_violations, aes(x = wk_day, y = n, group = violation)) +
geom_line() +
facet_wrap(~ violation)
Notice the ~ tilde. The tilde can be read as “by” -
semantically, this would be equivalent to saying
“I want to make new graphs separated by the cut categories”.
We can force the facet wrap to have 2 columns, i.e., a 3x2 grid.
Let’s also rename the y label to something else other than
n and remove the x label because the day of week is pretty
intuitive.
p <- ggplot(dow_violations, aes(x = wk_day, y = n, group = violation)) +
geom_line() +
facet_wrap(~ violation, ncol = 2) +
labs(title = 'Observed violations per day of week',
x = "", # Use "" to remove text on label
y = "Number of violations")
Lastly, you can do a small touch of customization to the facet wrapped graphs using different themes.
p + theme_bw()
\(\underline{Acknowledgements}\)
As mentioned in the preamble, I relied mainly on Claudia Engel’s R
tutorial series.
Another useful resource I found was Thomas Lin Pedersen’s two-part
Youtube tutorial, which
helped lay a theoretical foundation on the syntax logic behind
ggplot.